Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A probabilistic approach to printed document understanding

Identifieur interne : 000579 ( Main/Exploration ); précédent : 000578; suivant : 000580

A probabilistic approach to printed document understanding

Auteurs : Eric Medvet [Italie] ; Alberto Bartoli [Italie] ; Giorgio Davanzo [Italie]

Source :

RBID : Pascal:12-0083104

Descripteurs français

English descriptors

Abstract

We propose an approach for information extraction for multi-page printed document understanding. The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time. Describing a new class is a very simple task: the operator merely provides a few samples and then, by means of a GUI, clicks on the OCR-generated blocks of a document containing the information to be extracted. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. We estimate the parameters for a new class by applying the maximum likelihood method to the samples of the class. All these parameters depend only on block properties that can be extracted automatically from the operator actions on the GUI. Processing a document of a given class consists in finding the sequence of blocks, which maximizes the corresponding probability for that class. We evaluated experimentally our proposal using 807 multi-page printed documents of different domains (invoices, patents, data-sheets), obtaining very good results- e.g., a success rate often greater than 90% even for classes with just two samples.


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">A probabilistic approach to printed document understanding</title>
<author>
<name sortKey="Medvet, Eric" sort="Medvet, Eric" uniqKey="Medvet E" first="Eric" last="Medvet">Eric Medvet</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>DEEI, University of Trieste, Via A. Valerio 10</s1>
<s2>34127 Trieste</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>34127 Trieste</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Bartoli, Alberto" sort="Bartoli, Alberto" uniqKey="Bartoli A" first="Alberto" last="Bartoli">Alberto Bartoli</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>DEEI, University of Trieste, Via A. Valerio 10</s1>
<s2>34127 Trieste</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>34127 Trieste</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Davanzo, Giorgio" sort="Davanzo, Giorgio" uniqKey="Davanzo G" first="Giorgio" last="Davanzo">Giorgio Davanzo</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>DEEI, University of Trieste, Via A. Valerio 10</s1>
<s2>34127 Trieste</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>34127 Trieste</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">12-0083104</idno>
<date when="2011">2011</date>
<idno type="stanalyst">PASCAL 12-0083104 INIST</idno>
<idno type="RBID">Pascal:12-0083104</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000106</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000666</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000133</idno>
<idno type="wicri:doubleKey">1433-2833:2011:Medvet E:a:probabilistic:approach</idno>
<idno type="wicri:Area/Main/Merge">000585</idno>
<idno type="wicri:Area/Main/Curation">000579</idno>
<idno type="wicri:Area/Main/Exploration">000579</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">A probabilistic approach to printed document understanding</title>
<author>
<name sortKey="Medvet, Eric" sort="Medvet, Eric" uniqKey="Medvet E" first="Eric" last="Medvet">Eric Medvet</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>DEEI, University of Trieste, Via A. Valerio 10</s1>
<s2>34127 Trieste</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>34127 Trieste</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Bartoli, Alberto" sort="Bartoli, Alberto" uniqKey="Bartoli A" first="Alberto" last="Bartoli">Alberto Bartoli</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>DEEI, University of Trieste, Via A. Valerio 10</s1>
<s2>34127 Trieste</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>34127 Trieste</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Davanzo, Giorgio" sort="Davanzo, Giorgio" uniqKey="Davanzo G" first="Giorgio" last="Davanzo">Giorgio Davanzo</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>DEEI, University of Trieste, Via A. Valerio 10</s1>
<s2>34127 Trieste</s2>
<s3>ITA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Italie</country>
<wicri:noRegion>34127 Trieste</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint>
<date when="2011">2011</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Character recognition</term>
<term>Click</term>
<term>Content management</term>
<term>Document analysis</term>
<term>Document processing</term>
<term>Image interpretation</term>
<term>Information extraction</term>
<term>Information retrieval</term>
<term>Maximum likelihood</term>
<term>Modeling</term>
<term>Optical character recognition</term>
<term>Patent rights</term>
<term>Patents</term>
<term>Printed document</term>
<term>Probabilistic approach</term>
<term>Upgrading</term>
<term>User interface</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Interprétation image</term>
<term>Analyse documentaire</term>
<term>Extraction information</term>
<term>Gestion contenu</term>
<term>Interface utilisateur</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Recherche information</term>
<term>Traitement document</term>
<term>Document imprimé</term>
<term>Clic</term>
<term>Brevet</term>
<term>Propriété industrielle</term>
<term>Valorisation</term>
<term>Approche probabiliste</term>
<term>Maximum vraisemblance</term>
<term>Modélisation</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Brevet</term>
<term>Propriété industrielle</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">We propose an approach for information extraction for multi-page printed document understanding. The approach is designed for scenarios in which the set of possible document classes, i.e., documents sharing similar content and layout, is large and may evolve over time. Describing a new class is a very simple task: the operator merely provides a few samples and then, by means of a GUI, clicks on the OCR-generated blocks of a document containing the information to be extracted. Our approach is based on probability: we derived a general form for the probability that a sequence of blocks contains the searched information. We estimate the parameters for a new class by applying the maximum likelihood method to the samples of the class. All these parameters depend only on block properties that can be extracted automatically from the operator actions on the GUI. Processing a document of a given class consists in finding the sequence of blocks, which maximizes the corresponding probability for that class. We evaluated experimentally our proposal using 807 multi-page printed documents of different domains (invoices, patents, data-sheets), obtaining very good results- e.g., a success rate often greater than 90% even for classes with just two samples.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Italie</li>
</country>
</list>
<tree>
<country name="Italie">
<noRegion>
<name sortKey="Medvet, Eric" sort="Medvet, Eric" uniqKey="Medvet E" first="Eric" last="Medvet">Eric Medvet</name>
</noRegion>
<name sortKey="Bartoli, Alberto" sort="Bartoli, Alberto" uniqKey="Bartoli A" first="Alberto" last="Bartoli">Alberto Bartoli</name>
<name sortKey="Davanzo, Giorgio" sort="Davanzo, Giorgio" uniqKey="Davanzo G" first="Giorgio" last="Davanzo">Giorgio Davanzo</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000579 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000579 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:12-0083104
   |texte=   A probabilistic approach to printed document understanding
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024